Skip to content

feat(external): support Hive-style partitioned Parquet external tables#24329

Merged
mergify[bot] merged 6 commits into
matrixorigin:3.0-devfrom
iamlinjunhong:dp-hive
May 12, 2026
Merged

feat(external): support Hive-style partitioned Parquet external tables#24329
mergify[bot] merged 6 commits into
matrixorigin:3.0-devfrom
iamlinjunhong:dp-hive

Conversation

@iamlinjunhong
Copy link
Copy Markdown
Contributor

What type of PR is this?

  • API-change
  • BUG
  • Improvement
  • Documentation
  • Feature
  • Test and CI
  • Code Refactoring

Which issue(s) this PR fixes:

issue #24320

What this PR does / why we need it:

support Hive-style partitioned Parquet external tables

@qodo-code-review
Copy link
Copy Markdown

ⓘ You've reached your Qodo monthly free-tier limit. Reviews pause until next month — upgrade your plan to continue now, or link your paid account if you already have one.

Copy link
Copy Markdown
Contributor

@XuPeng-SH XuPeng-SH left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the substantial effort here — +6347 lines with 3385 lines of tests, clean file-level decomposition, and real edge-case coverage in the BVT. A lot of it is solid. My concerns focus on whether this is ready to be announced as "Hive-style partitioned Parquet" given how several pieces interact.

Leaving as request-for-changes for the 🔴 items; the 🟡 items are open questions I'd like resolved before merge or explicitly deferred with a follow-up issue.

🔴 Blocking

1. Hive tables are scanned serially regardless of cluster size

compile.go:1675-1679 unconditionally forces param.Parallel = false for Hive tables:

if param.HivePartitioning {
    param.Parallel = false
}

Combined with getReadWriteParallelFlag (:1596-1604), this routes every Hive table into compileExternScanSerialReadWrite at :1862 — a single scope, no cross-CN fanout. A table with 10K partitions = 10K files is opened one at a time on one CN. The inline comment says it's because "the parallel loop mutates param.Filepath per file" — that's a correctness workaround, not a design decision.

This dwarfs every other performance concern. The feature technically works on small datasets but is unusable on anything Hive workloads actually look like. Please either:

  • fix the param.Filepath mutation so parallel paths can be reused, or
  • build a file-level fanout specifically for Hive (the per-partition file set is ideal for this).

2. S3 FS rebuilt per List call

hive_partition.go:94-105NewListDirFunc calls plan2.GetForETLWithType(param, prefix) on every recursive directory listing. For each call the option string is re-encoded and a fresh S3FS is constructed. For N partitions at depth D that's O(N·D) S3FS constructions on top of O(N·D) LIST calls. The code already carries a TODO:

TODO: For S3 this re-creates an S3FS instance per List call; pre-build the FS once and reuse across recursive calls for better performance.

Please pre-build the FS once and reuse. The combination of #1 + #2 means S3-backed Hive tables are going to surprise users badly.

3. Partition directory names are not sanitized against traversal

ParseHivePartitionSegment (hive_partition.go:143-151) accepts any bytes in Value — including /, .., NUL, newlines. The only rejection is % for URL-encoded content. Later path.Join(prefix, entry.Name) (:303) collapses ... On local FS a directory literally named year=.. (or any attacker-plantable layout underneath a tenant-readable base path) walks out of the intended subtree.

The directory name comes from fileservice.DirEntry, which reflects the real filesystem entry — it is attacker-reachable on any writable path that a different principal can manipulate. Please reject control bytes, /, .., and NUL in both Key and Value. An explicit allow-list is safer than an ad-hoc reject list.

🟡 Please address or acknowledge

4. "Hive compatibility" is narrower than the title claims

  • URL-encoded partition values are rejected (hive_partition.go:269-272). Spark / Hive / Trino write country=US%2FCA verbatim whenever a partition value contains /, :, % or other filesystem-unsafe bytes. Datasets using such values cannot be read. The test comment at hive_partition_external_table.sql acknowledges this as "P0 known limitation" — but if it's P0 and known-broken, the feature title overclaims compatibility.
  • DATE / TIMESTAMP / FLOAT / DECIMAL partitions do not prune (hive_partition.go:419-431 canPruneType). WHERE dt = '2024-01-01' lists every partition. Hive workloads are predominantly date-partitioned; this materially regresses the most common pruning case.
  • Predicate pruning supports only = and IN with literals (hive_partition.go:586-595). >, >=, BETWEEN, !=, NOT IN, LIKE, CAST(...), and any OR combination all fall back to full listing. Again, time-range queries are the typical Hive access pattern.

I'm not asking for all three to be fixed in this PR, but please either (a) scope the title/release notes to what actually works, or (b) file explicit follow-up issues and link them. Right now "support Hive-style partitioned Parquet" sets a wrong expectation.

5. Cross-partition schema evolution is untested and probably broken

Hive datasets routinely accumulate parquet files with drifting schemas (added/dropped columns across partition directories). The code caches the schema from the first opened file and findColumnIgnoreCase is per-file, so behavior depends on which file happens to be scanned first. Zero BVT cases for add-col / drop-col across partitions. Please add a test and document the behavior (fail hard / fill NULL / skip column) explicitly.

6. Rolling upgrade within 3.0-dev

parsers/tree/update.go:4511-4525 extends ExParamConst with HivePartitioning/HivePartitionCols/HivePartColTypes. These are serialized into Createsql via json.Marshal at DDL time (build_ddl.go:930) and re-parsed on every compile on the CN that runs the query (compile.go:1573). A CN on the older code inside the 3.0-dev window will json.Unmarshal successfully (unknown fields are ignored), miss the Hive flags, and route the table through generic ReadDir. For a Hive base path like /warehouse/sales/ with no glob characters, ReadDir (utils.go:2084-2133) returns an empty file list, so the query returns zero rows rather than wrong rows — still a visible break during mixed deployment. Please guard with a feature flag or version check, or document the ordering constraint for the 3.0 cutover.

7. A single bad parquet kills the whole query

parquet.go + scanParquetFile returns an error out of the Hive loop if newParquetHandler fails on any file (0-byte, truncated, _temporary residue). Production Hive datasets carry broken files as a matter of course; Spark/Trino tolerate them with per-file skip + log. Please at minimum add a continue-on-error path for the Hive reader, matching what CSV external already does in some configs.

8. Hive keywords added to DDL without a feature flag

build_ddl.go:939 + utils.go:1673-1829 (parseHiveOptionKV). Any existing external-table DDL that happens to use hive_partitioning / hive_partition_columns as option keys for unrelated reasons changes semantics silently. Previously these would be rejected at the "keyword not supported" default. Low probability but zero rollback. Either (a) gate behind a server-level feature flag for the first release cycle or (b) add a changelog entry specifically flagging this.

🟢 Follow-ups (fine to defer)

  • Partition discovery is single-goroutine and serial (hive_partition.go:258-322). At depth 3 with 99-wide fanout you exhaust maxListCalls=10000. With a worker pool + LIST prefetching this is easily 10× better on S3. Make the two constants (maxListCalls, maxPartitionCount) configurable while you're at it.
  • matchInt / matchUint strip leading zeros (hive_partition.go:434, :451) so year=007 matches year=7, but country='CN' on varchar requires exact match. Two partition columns on the same dataset can have inconsistent pruning semantics. Documented behavior would help.
  • extractVecValues passes nil mempool to vec.Free(nil) (hive_partition.go:668). Other sites typically use proc.Mp(). Please verify Vector.Free semantics on nil mempool are intended; at minimum add a comment pinning the expectation.
  • SHOW CREATE TABLE drops case on partition column names (build_ddl.go:4625-4627 lowercases before storing). 'hive_partition_columns'='Year,Month' round-trips as 'year,month', breaking case-sensitive consumers of SHOW CREATE.
  • Float partition values have no NaN/Inf guard (hive_partition_fill.go:1911,1918). amount=inf/data.parquet silently fills +Inf.
  • HivePartColType embedded in tree package (parsers/tree/update.go:4519). Layering workaround to avoid importing pkg/pb/plan, but now tree knows plan type codes. A TODO with the plan-side extraction target would make this less load-bearing.
  • isHivePartitionCol linear scan per column (hive_partition_fill.go:1730-1741). Fine for small schemas, quadratic for wide tables.
  • Placeholder imports in testsvar _ = catalog.ExternalFilePath / var _ = function.EQUAL / var _ = math.MaxInt32 in hive_partition_test.go:3890-3891 and _coverage_test.go:1692. Just remove the imports.
  • Copyright headers mix 2024 (hive_partition.go) and 2026 (hive_partition_coverage_test.go, compile/hive_partition_test.go). Cosmetic.

What's genuinely good (for balance)

  • Split of hive_partition.go (discovery + pruning) vs hive_partition_fill.go (constant-vector fill at scan time) is clean.
  • ClassifyFilters keeping partition filters in both partitionFilters and rowFilters for double-filtering safety (hive_partition.go:501-504) is a good defensive choice.
  • Constant-vector fill path (SetConstFixed / SetConstBytes) avoids per-row work in the hot loop. refreshPartitionValues runs once per file open, not per batch.
  • extractVecValues defensively bounds-checks the binary envelope before UnmarshalBinary (hive_partition.go:738-760).
  • The filePathColSet comment explaining why STATEMENT_ACCOUNT is excluded is exemplary.
  • BVT covers 11 sections including real negative cases with exact error strings, and the .result shows honest float rounding rather than pretty-printed.

Happy to chat through any of the 🔴 items. The core approach is right; the asks above are about not shipping the announcement until the implementation matches it.

@iamlinjunhong
Copy link
Copy Markdown
Contributor Author

re-parsed on every compile on the CN that runs the query (compile.go:1573). A CN on the older code inside the 3.0-dev window will json.Unmarshal successf

addressed the three blocking items in the latest commit.

  1. Hive external scans are no longer forced into a single serial scope. Instead, Hive now uses a file-level fanout path after partition discovery. The discovered parquet files are sharded by file size across available CN slots for S3, and each shard scans whole files with Parallel=false internally. This avoids the generic parallel-read path mutating param.Filepath, while preserving Extern.Filepath as the Hive base path for partition value extraction.

  2. S3 partition discovery no longer rebuilds the ETL/S3 filesystem on every recursive List call. NewListDirFunc now prebuilds the S3 ETL filesystem once and derives the per-prefix read path for subsequent recursive lists.

  3. Hive partition directory segments are now validated before use. Partition keys/values reject path separators, . / .., NUL/control characters, and % URL-encoded names. I added unit coverage for parser rejection and discovery rejection.

also removed the placeholder test imports and added a comment documenting why vec.Free(nil) is intentional after UnmarshalBinary.

For the non-blocking compatibility items: URL-encoded partition values remain explicitly unsupported in P0; pruning is still limited to = / IN / AND for the currently prunable types, with row-level filters preserving correctness for range, OR, CAST, DATE/TIMESTAMP/DECIMAL, etc. I agree the PR title/release notes should scope this as Hive-style partitioned Parquet P0 rather than full Hive compatibility. Cross-partition schema evolution is also outside P0; current behavior is “extra parquet columns ignored, missing selected physical columns fail with column not found.” Rolling upgrade and bad-parquet skip behavior should be documented or handled in follow-ups; I did not add silent bad-file skipping because that can hide data loss without an explicit option.

@Ariznawlll
Copy link
Copy Markdown
Contributor

Performance Benchmark Results

Benchmarked this PR (commit ab6058a) at two scales using a synthesized TPC-DS-like catalog_returns Parquet dataset. Hardware: Apple M-series laptop, local SSD.

Scope

  • SMALL: 500 partitions × 10,000 rows = 5M rows (~150 MB on disk)
  • MEDIUM: 2,000 partitions × 50,000 rows = 100M rows (~3 GB on disk)
  • LARGE (10,000 partitions): skipped, ran out of local disk (~30 GB needed)
  • StarRocks cross-engine comparison: intended but blocked — SR's allin1-ubuntu image BE segfaults on both arm64 and rosetta-amd64 on Apple Silicon. Would need a remote SR cluster to re-attempt.

All queries run against a single-CN local MatrixOne with PR binary. Warmup = 1 run, timed = 3 runs, median reported.

Results

Query SMALL (500 part) MEDIUM (2000 part) Comment
Q1 full scan 313 ms 12.3 s Baseline throughput
Q2 single-partition = 26 ms 35 ms Near-constant regardless of total partitions — pruning works
Q3 IN (5 partitions) 29 ms 59 ms ~2× Q2, scales linearly with N matched partitions
Q4 BETWEEN range (docs say not pruned) 79 ms 521 ms Scales with total data, as documented
Q5 GROUP BY partition_col 96 ms 951 ms Appears to touch files; may have room to be list-only
Q6 JOIN + IN pushdown 55 ms 90 ms ✅ IN list pushed into pruning
Q7 COUNT(DISTINCT partition_col) 99 ms 795 ms Similar to Q5 — opens files unnecessarily?

Key finding

Partition pruning works as advertised. Q2 latency is ~flat from SMALL to MEDIUM (26ms → 35ms) even as total partition count went 4×. This is the main promise of the feature and it holds.

Side effects (filed separately)

While running the benchmark I hit two issues that block the advertised usage pattern:

  1. [Bug]: s3option + format='parquet' rejected by InitS3Param (blocks hive_partitioning over S3) #24341URL s3option{... 'format'='parquet' ...} is rejected by InitS3Param. The entire S3 path in the PR's user guide is currently unusable; only local infile{} + parquet works. Had to fall back to local filesystem for this benchmark.
  2. [Bug]: PR #24329 regression — DECIMAL columns can no longer be loaded from Parquet #24343 — DECIMAL columns can no longer be loaded from Parquet on this branch. The PR's parquet.go rewrite (−1489 / +298 lines) removed the T_decimal64/128/256 mappers that exist on main. Had to rewrite queries to avoid decimal aggregates.

Suggestions before merge

Reproducibility

All benchmark scripts, data generator, DDL and queries at /tmp/mo_hive_perf/ locally. Can share if useful.

aunjgr

This comment was marked as outdated.

@iamlinjunhong
Copy link
Copy Markdown
Contributor Author

Performance Benchmark Results

Benchmarked this PR (commit ab6058a) at two scales using a synthesized TPC-DS-like catalog_returns Parquet dataset. Hardware: Apple M-series laptop, local SSD.

Scope

  • SMALL: 500 partitions × 10,000 rows = 5M rows (~150 MB on disk)
  • MEDIUM: 2,000 partitions × 50,000 rows = 100M rows (~3 GB on disk)
  • LARGE (10,000 partitions): skipped, ran out of local disk (~30 GB needed)
  • StarRocks cross-engine comparison: intended but blocked — SR's allin1-ubuntu image BE segfaults on both arm64 and rosetta-amd64 on Apple Silicon. Would need a remote SR cluster to re-attempt.

All queries run against a single-CN local MatrixOne with PR binary. Warmup = 1 run, timed = 3 runs, median reported.

Results

Query SMALL (500 part) MEDIUM (2000 part) Comment
Q1 full scan 313 ms 12.3 s Baseline throughput
Q2 single-partition = 26 ms 35 msNear-constant regardless of total partitions — pruning works
Q3 IN (5 partitions) 29 ms 59 ms ~2× Q2, scales linearly with N matched partitions
Q4 BETWEEN range (docs say not pruned) 79 ms 521 ms Scales with total data, as documented
Q5 GROUP BY partition_col 96 ms 951 ms Appears to touch files; may have room to be list-only
Q6 JOIN + IN pushdown 55 ms 90 ms ✅ IN list pushed into pruning
Q7 COUNT(DISTINCT partition_col) 99 ms 795 ms Similar to Q5 — opens files unnecessarily?

Key finding

Partition pruning works as advertised. Q2 latency is ~flat from SMALL to MEDIUM (26ms → 35ms) even as total partition count went 4×. This is the main promise of the feature and it holds.

Side effects (filed separately)

While running the benchmark I hit two issues that block the advertised usage pattern:

  1. [Bug]: s3option + format='parquet' rejected by InitS3Param (blocks hive_partitioning over S3) #24341URL s3option{... 'format'='parquet' ...} is rejected by InitS3Param. The entire S3 path in the PR's user guide is currently unusable; only local infile{} + parquet works. Had to fall back to local filesystem for this benchmark.
  2. [Bug]: PR #24329 regression — DECIMAL columns can no longer be loaded from Parquet #24343 — DECIMAL columns can no longer be loaded from Parquet on this branch. The PR's parquet.go rewrite (−1489 / +298 lines) removed the T_decimal64/128/256 mappers that exist on main. Had to rewrite queries to avoid decimal aggregates.

Suggestions before merge

Reproducibility

All benchmark scripts, data generator, DDL and queries at /tmp/mo_hive_perf/ locally. Can share if useful.

fixed #24341, #24343

Copy link
Copy Markdown
Contributor

@XuPeng-SH XuPeng-SH left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for 3.0-dev. The three blockers from my earlier review were addressed in 1f7e8d3:

  1. File-level fanout for Hive tables (compileExternScanHiveFileFanout) replaces the forced-serial path; covered by TestCompileExternScanHiveFileFanout.
  2. S3FS is prebuilt once in NewListDirFunc and reused across recursive List calls via deriveHiveListReadPath.
  3. ParseHivePartitionSegment rejects '.', '..', '/', '\', NUL, control bytes, and '%'; DiscoverHivePartitions re-checks '%' at the List boundary.

NaN/Inf guards are in place in parseIntWithFloatFallback / parseUintWithFloatFallback. extractVecValues bounds-checks the binary envelope before UnmarshalBinary and safeVarlenaBytes guards area slicing. BVT coverage is strong (11 sections, honest negative cases).

CI is green including Upgrade Compatibility. Suggest squash-merging to collapse the four "fix" commits into one feature commit.

Non-blocking follow-ups worth separate issues:

  • Forward-port refreshPartitionValues / fillVirtualColumns into reader_parquet.go when 3.0-dev → main (scanParquetFile is gone on main since a2ac9f2).
  • Prune DATE/TIMESTAMP partitions and range predicates (the common Hive access pattern).
  • Make maxListCalls / maxPartitionCount server-level tunables.
  • Parallelize discoverRecursive on S3.
  • Cross-partition schema evolution test + documented semantics.
  • continue-on-error for bad parquet files.
  • Add explicit json:"..." tags to the new ExParamConst fields before 3.0-dev data hardens around the default names.
  • Scope the PR title / release notes to "Hive-style partitioned Parquet (P0)" since URL-encoded values and date-partition pruning are intentionally not supported yet.

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 12, 2026

Merge Queue Status

  • Entered queue2026-05-12 02:56 UTC · Rule: release-3.0
  • Checks skipped · PR is already up-to-date
  • Merged2026-05-12 02:57 UTC · at e3946f62c539ce16103a26d917dc2067d8bf814b · squash

This pull request spent 22 seconds in the queue, including 4 seconds running CI.

Required conditions to merge
  • #approved-reviews-by >= 1 [🛡 GitHub branch protection]
  • #changes-requested-reviews-by = 0 [🛡 GitHub branch protection]
  • branch-protection-review-decision = APPROVED [🛡 GitHub branch protection]
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone Utils CI (3.0) / Coverage
    • check-neutral = Matrixone Utils CI (3.0) / Coverage
    • check-skipped = Matrixone Utils CI (3.0) / Coverage
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone CI (3.0) / SCA Test on Ubuntu/x86
    • check-neutral = Matrixone CI (3.0) / SCA Test on Ubuntu/x86
    • check-skipped = Matrixone CI (3.0) / SCA Test on Ubuntu/x86
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone CI (3.0) / UT Test on Ubuntu/x86
    • check-neutral = Matrixone CI (3.0) / UT Test on Ubuntu/x86
    • check-skipped = Matrixone CI (3.0) / UT Test on Ubuntu/x86
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone Compose CI (3.0) / multi cn e2e bvt test docker compose(Optimistic/PUSH)
    • check-neutral = Matrixone Compose CI (3.0) / multi cn e2e bvt test docker compose(Optimistic/PUSH)
    • check-skipped = Matrixone Compose CI (3.0) / multi cn e2e bvt test docker compose(Optimistic/PUSH)
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone Compose CI (3.0) / multi cn e2e bvt test docker compose(PESSIMISTIC)
    • check-neutral = Matrixone Compose CI (3.0) / multi cn e2e bvt test docker compose(PESSIMISTIC)
    • check-skipped = Matrixone Compose CI (3.0) / multi cn e2e bvt test docker compose(PESSIMISTIC)
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone Standlone CI (3.0) / Multi-CN e2e BVT Test on Linux/x64(LAUNCH, PROXY)
    • check-neutral = Matrixone Standlone CI (3.0) / Multi-CN e2e BVT Test on Linux/x64(LAUNCH, PROXY)
    • check-skipped = Matrixone Standlone CI (3.0) / Multi-CN e2e BVT Test on Linux/x64(LAUNCH, PROXY)
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone Standlone CI (3.0) / e2e BVT Test on Linux/x64(LAUNCH, PESSIMISTIC)
    • check-neutral = Matrixone Standlone CI (3.0) / e2e BVT Test on Linux/x64(LAUNCH, PESSIMISTIC)
    • check-skipped = Matrixone Standlone CI (3.0) / e2e BVT Test on Linux/x64(LAUNCH, PESSIMISTIC)
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone Standlone CI (3.0) / e2e BVT Test on Linux/x64(LAUNCH,Optimistic)
    • check-neutral = Matrixone Standlone CI (3.0) / e2e BVT Test on Linux/x64(LAUNCH,Optimistic)
    • check-skipped = Matrixone Standlone CI (3.0) / e2e BVT Test on Linux/x64(LAUNCH,Optimistic)
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone Upgrade CI (3.0) / Compatibility Test With Target on Linux/x64(LAUNCH)
    • check-neutral = Matrixone Upgrade CI (3.0) / Compatibility Test With Target on Linux/x64(LAUNCH)
    • check-skipped = Matrixone Upgrade CI (3.0) / Compatibility Test With Target on Linux/x64(LAUNCH)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/XXL Denotes a PR that changes 2000+ lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants